Skip to content
This repository has been archived by the owner on May 22, 2023. It is now read-only.

[USMP] Initial implementation of liveness analysis for Relax + TIR #250

Open
wants to merge 170 commits into
base: relax
Choose a base branch
from

Conversation

gigiblender
Copy link
Contributor

@gigiblender gigiblender commented Sep 15, 2022

This PR adds an initial implementation of liveness analysis of tensors/buffers for Relax and TIR programs.

@areusch @mbaret @YuchenJin @mikepapadim

@YuchenJin
Copy link
Collaborator

Thanks @gigiblender for integrating USMP into Relax!

One idea about the liveness analysis pass: we can have a memory lifting pass which lifts the memory allocations in TIR into Relax first, and this will allow the liveness analysis pass to analyze only the Relax functions without the need to analyze the TIR Primfuncs in the IRModule. Would love to hear your thoughts. 😄

And one suggestion for the test case construction, we encourage developers to use the block_builder and emit_te api to construct the IRModule if the TVMScript is very long, for example: https://github.com/tlc-pack/relax/blob/relax/tests/python/relax/test_transform_fuse_ops.py#L51-L57. This will make the test case more concise.

@areusch
Copy link
Contributor

areusch commented Sep 21, 2022

thanks @YuchenJin !

One idea about the liveness analysis pass: we can have a memory lifting pass which lifts the memory allocations in TIR into Relax first, and this will allow the liveness analysis pass to analyze only the Relax functions without the need to analyze the TIR Primfuncs in the IRModule. Would love to hear your thoughts. 😄

one challenge we have with lifting allocs is that if a TIR PrimFunc has two internal allocs which don't overlap, then we wouldn't be able to detect that solely by looking at Call(relax.builtin.alloc_tensor. However, I think that we might want to iterate on this PR to derive liveness based on first/last usage rather than just alloc nodes, so maybe this is less of a concern.

@YuchenJin
Copy link
Collaborator

one challenge we have with lifting allocs is that if a TIR PrimFunc has two internal allocs which don't overlap, then we wouldn't be able to detect that solely by looking at Call(relax.builtin.alloc_tensor.

Thanks @areusch! If we run the MetaSchedule tuning pass or other transformations/schedules first (which is usually the case since memory planning is at the later stage of the compilation), the temporary allocs inside TIR PrimFunc will get removed, so usually there will not be multiple temporary alloc in a TIR PrimFunc. Would love to know the cases where there are several temporary allocs.

@areusch
Copy link
Contributor

areusch commented Sep 22, 2022

hm, i was thinking that you would see this case when doing multi-anchor fusion. I haven't explored that enough yet to know, though. it does seem like there isn't anything in TIR preventing this case from happening though, and if folks are writing custom TIR passes, it might not be sufficient to rely on MetaSchedule to reuse Buffers in TIR. with that said, this might not be as high of a priority if MetaSchedule does do this.

I'm not sure resolving this question changes the approach of modifying the LivenessAnalysis to generate alloc/kill events based on usage. However, it's certainly a good thing to understand further.

@YuchenJin
Copy link
Collaborator

hm, i was thinking that you would see this case when doing multi-anchor fusion. I haven't explored that enough yet to know, though. it does seem like there isn't anything in TIR preventing this case from happening though, and if folks are writing custom TIR passes, it might not be sufficient to rely on MetaSchedule to reuse Buffers in TIR. with that said, this might not be as high of a priority if MetaSchedule does do this.

I'm not sure resolving this question changes the approach of modifying the LivenessAnalysis to generate alloc/kill events based on usage. However, it's certainly a good thing to understand further.

Yes, I agree it does not change the general approach. My thought is if there are usually not multiple temporary allocs in a TIR PrimFunc, the liveness analysis pass would just need to traverse the Relax function after memory lifting, which would simplify the assumption and reduce the complexity of the liveness analysis pass by a lot. :)

tqchen and others added 24 commits October 14, 2022 12:49
Co-Authored-By: Yuchen Jin <[email protected]>
Co-authored-by: ZihengJiang <[email protected]>
* Implementation of call_dps.

* Implementation of PackedFuncExpr.

* Test CallDPS for TIR function.

* Rename.

* Add header and comments.

* Update.

* Address comments.
* Update AST.

* ShapeOf.

* ShapeOf.

* Address comment.
* Add initial IRBuilder.

* Add function output to irbuilder; update based on new AST.

* Add call method; clean up bindings

* Add test.

* Add multifuction test

* Move implementation to C++; infer shape and type

* update op python hook

* More tests and bug fix

* Add comments.

* Update shape/type inference.

* Restructure code; add python type hint.

* Cleanup code.

* Rebase; address comments.

* Add call intrinsic.

* nits.

* Remove call op.

* Migrate scope to C++ using tvm::With.

* Address naming.

* Add GetBlocks API.

* Unify EmitOutput APIs; add more comments.

* Remove shape and type deduction code.

* Also remove the shape/type attr interface.

* Address comments.

* Differentiate global and local function.

* Reset counter after building func/block.

* Rebase.

* Remove shape infer builtin.

* Return from void function as empty tuple.

Co-authored-by: Michalis Papadimitriou <[email protected]>
* Copy jared's frontend

* Remove some extraneous code + add TODOs

* Skeleton AST

* Added more skeleton AST, worked on parsing shape annotations. Something is wrong with span_to_span

* Fix spans

* Type annotations parsing correctly

* some match_shape support

* More bug fixes! Some stuff parses. Importing into tests is messed up. We probably need to restructure this code as well.

* refactor parser and fill out more stubs

* some parser tests

* yolo dataflow

* checkpoint for rebase

* hook up AST

* add inline TIR parsing

* some cleanup

* support call_packed parsing to ExternFunc call

* remove stub ops

* improve docstrings

* address nits

* support coercing tuples to ShapeExpr when possible for call_dps

Co-authored-by: electriclilies <[email protected]>
* Shape and type deduction.

* Fix header.

* Add call attrs to the deduce signature.

* Address comments.

* Add DiagnosticContext to IRBuilder and inference signature.

* Fix nits.
* Relax pretty printer initial prototype

* call into TVMScriptPrinter for PrimFuncs

* most round-trip tests pass

* address comments

* fix typo
…c-pack#9)

* Relax pretty printer initial prototype

* call into TVMScriptPrinter for PrimFuncs

* most round-trip tests pass

* address comments

* implement relax.output syntax for dataflow block outputs

* remove leftover comments

* fix Var constructor on ShapeExpr annotation

* fix DataflowVar as well
* Update MatchShape AST Node.

* Update.

* Update.
* Relax pretty printer initial prototype

* call into TVMScriptPrinter for PrimFuncs

* most round-trip tests pass

* address comments

* implement relax.output syntax for dataflow block outputs

* remove leftover comments

* fix Var constructor on ShapeExpr annotation

* add printing and parsing for simple PrimExpr and Call Attrs
* ExprVisitor/ExprMutator for relax nodes.

* Update Visitor & Mutator.

* Update Mutator.

* DataflowMutator interface.

* EwiseFMARewriter.

* Update fma rewrite and add test.

* Update test.

* Fix dataflow block dispatching.

* Construct new dataflow block with IRBuilder.

* VisitBinding return void and mutate internal IRBuilder.

* Simplify.

* Update emit dataflow output.

* Explicit memeory allocation rewrite.

* LazyIRBuilder.

* Update ExplicitMemMutator.

* Overload IRBuilder::Emit to have 3 styles.

* Update IRBuilder/IRMutator interfaces and passes.

* Add MatchShape binding to IRBuilder.

* Improve IRMutator interface; add Normalize and CanProveShapeEqual to IRBuilder

* Update EmitMatchShape.

Co-authored-by: ZihengJiang <[email protected]>
…ort for call_dps (tlc-pack#15)

* update parser and printer for match_shape

* support parsing class to IRModule, and extern func in call_dps
* [PASS] Shape lowering.

* Update to IRModule based.

* TIR function generation.

* Improve.

* Improve.

* Improve test.

* Improve.

* Address comment.
…lc-pack#19)

* relax call_packed arity, return IRModule factory, print IRModule PrimFuncs

* explicitly parse and print attrs_type_key on calls

* print type even when attrs has no fields
* VM compiler.

* Update.

* Compile IRmodule; expose Python api

* Add dtype contant serialization and type hint.

* Address comments.

* Add todos and fix lint.

* Update

* Update.
* init

* update

* update

* test case working

* update and add multi block test case

* check in

* fixes

* fix

* update

* add

* update

* add

* update

* address comments.

Co-authored-by: Altan Haan <[email protected]>
…ble (tlc-pack#21)

* rebase.

* Update.

* Update shape lowering, make sure the lowering pipeline works.

* Address comment.
* call_dps lowering.

* Improve shape lowering.

* Support alloc_storage for dynamic shape.

* implementt ToNonDF to transform program to non-dataflow format.

* Fix the mutator issue.

* Update build api, an issue occurred.

* vm tests can pass.

* Support shape tuple in executable seriablization.

* Fix for test.

* Minor fixes.

* Address comments.

* Add mutate binding var back.

* Visit binding var and fix tests.

Co-authored-by: YuchenJin <[email protected]>
slyubomirsky and others added 12 commits October 16, 2022 08:29
It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions.

This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. (Note: This required updating `BlockBuilder` to also update its binding map for `MatchShape` nodes; that was arguably a bug.) Additionally, `MatchShape` bindings where the `LHS` and the `RHS` are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.
Fix an incorrect check which disables emitting global MatchShape outside a dataflow block and mistakenly enables emitting dataflow MatchShape outside a dataflow block.
…-pack#247)

This PR makes some small additions to the end-to-end AutoTIR script, namely eliminating a bug (it was incorrectly using the stateful API) and adding an option to save the test results as a CSV file for benchmarking purposes (the data can then be separately analyzed as needed).

These changes also required a small extension to the save_function method in the VM, namely allowing it to take keyword arguments.
Attempting to use `dump_ast` on functions containing the operators `relax.unique` and `relax.print` previously crashed due to being unable to query their attributes' keys. It turned out that this was a problem with the operator attributes: They had not been registered on the Python side, so Python representation treated them as opaque TVM objects. This PR corrects this mistake.
…ck#254)

This small PR changes a check in the tvmscript parser to support empty shape tuples which are used to represent scalars. I added a scalar addition test to make sure it works properly.
…ack#257)

It was observed that closures saved using `save_function` would crash when used over RPC with the `time_evaluator`, whereas using `set_input` and `invoke_stateful` worked as normal. While I am not entirely sure why these failures happened over RPC only in `time_evaluator` (but not in other RPC trials), it became clear that `set_input` performs a conversion of input tensor values in `SetInputTensorWithIndex`, while `save_function` was not doing this. Adding this conversion fixed the observed bug.
This PR adds a `ret_shape` field for specifying the shape of the function's return value. At present, we will not use this information, but by adding it into the AST, we will be able to parse the return shape and use it in the future. Parser V1 in this PR will just always list the `ret_shape` as `RuntimeDepShape`.
Previously, analyses to gather up all variables, free variables, bound variables, all global variables, and all global variables that are called had been implemented in C++ but had not been exposed in Python or tested. This PR exposes these analyses and adds tests for them.

Two further changes:
* The analyses previously ignored variables bound in `MatchShape` nodes; these are now treated as bindings too.
* `rec_global_vars` is renamed `called_global_vars`, since the analysis itself does not check recursion.
* Support Function and If in Normalize pass.

* Use structural equality for expr_memo_.

* Change back to pointer equality for expr_memo_; Add more tests.

* rebase.
It was brought up that Relay lacks an assert operator, so we may as well have one in Relax for debugging. One issue is that we can't name it "`assert`" because Python will treat it as a syntax error to have it as a field name for the "`relax`" module, i.e., `relax.assert` is a syntax error. Thus the op is named "`assert_op`," which is not ideal but serves its purpose.
[TVMScript] B4: If branch support (tlc-pack#263)
B8: Local Function Support  (tlc-pack#258)
[TVMScript] B3: Type annotation checks (tlc-pack#256)
[TVMScript][Parser] B1: Dataflow block (tlc-pack#252)
[TVMScript] B2: match shape support (tlc-pack#251)
[TVMScript] B6/B7: Symbolic shape and var shadowing  (tlc-pack#245)
[TVMScript] B5: Support relax op (tlc-pack#244)
[TVMScript] B0: Call_tir support (tlc-pack#243)
enhance parser error reporting (tlc-pack#242)
[TVMScript] A1: Relax Parser infra (tlc-pack#240)
update ci image versions. (tlc-pack#241)
[TVMScript] B2-4: TIR IRBuilder (tlc-pack#239)
[TVMScript] A0: Relax IRBuilder infra (tlc-pack#235)
[TVMScript] B5-6: TIR IRBuilder (tlc-pack#231)
[TVMScript] B1: IRBuilder (tlc-pack#228)
[TVMScript] New Parser: Part C (tlc-pack#218)
[TVMScript] New Parser: Part A (tlc-pack#221)
[TVMScript] New Parser: Part B (tlc-pack#217)

Not recovered:
[Pass] Separate ApplyHistoryBest from tuning passes (tlc-pack#226)
[Bugfix] Couple of bug fixes to run TVM-gen code together with BYOC (tlc-pack#249)

co-authored-by: Yuchen Jin <[email protected]>
co-authored-by: Siyuan Feng <[email protected]>
co-authored-by: Ruihang Lai <[email protected]>
Copy link
Contributor

@mbaret mbaret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think largely this is code from existing USMP and the new additions seem in-keeping with that code. Minor comments only - further debates on things like testing methodology are probably better had when we take this to main.


Returns
-------
Map<relax::Expr, BufferInfo>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict[relay.Expr, BufferInfo] to be more python-style

/*!
* \file relax/usmp/analysis/extract_buffer_info.cc
*
* \brief This analysis pass consumes a TIR IRModule with a main function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to reflect the Relax/TIR module

ExprVisitor::VisitBinding_(binding);
}

static Integer CalculateRelaxExtentsSize(const DataType& dtype, const Array<PrimExpr>& extents) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can consolidate this with the TIR version?

@mbaret
Copy link
Contributor

mbaret commented Oct 17, 2022

My thought is if there are usually not multiple temporary allocs in a TIR PrimFunc, the liveness analysis pass would just need to traverse the Relax function after memory lifting, which would simplify the assumption and reduce the complexity of the liveness analysis pass by a lot. :)

Ethos-U is a motivator for this functionality as it doesn't use metaschedule but does have multiple allocates in a single prim func. Doing buffer consolidation on a per-primfunc basis will also be generally less efficient than doing it with global knowledge where the memory fragmentation pattern is known.

@gigiblender gigiblender force-pushed the usmp-live-analysis branch 2 times, most recently from f466d20 to 7600f9c Compare October 19, 2022 10:01
This commit changes the behavior of the parser to allow type annotations, as suggested by the community.
The current behavior:
- Use the more refined type/shape between user annotated and deduced type/shape.
The updated behavior:
- Always use user annotations
- Only checks if the type/shape is valid.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.